Towards a treebank for all tastes

نویسندگان

  • Petr Jäger
  • Vladimír Petkevič
  • Alexandr Rosen
  • Hana Skoumalová
چکیده

Syntax is a discipline of many theories, and it is accordingly difficult to build a syntactically annotated corpus that would not put off at least some syntacticians by an alleged or real theoretical bias. Yet despite appearances and focus on slightly different sets of linguistic phenomena the theories strive to describe and explain the same object – a natural language. In fact there is a large pool of implicit wisdom shared by all syntactic theories and a significant overlap of linguistic knowledge can be extracted from all theory-specific formats. Thus a treebank offering different views of syntactic annotation while based on a single core pattern need not be a dream out of touch with reality. In addition to constituency and dependency trees of various shapes, suited to the taste of experts in linguistics, one of the views may be close to the representation of syntactic structure to which Czech students are exposed at the higher elementary and secondary levels. Such a treebank should indeed be useful beyond academic community to other professionals and lay users interested in language and linguistics. Obviously, for most of them the bigger the better, but not at an unbearable decrease in reliability. Yet the largest existing treebanks reach the relatively modest sizes of several million words, an insufficient number for many tasks. The reason is the cost of manual checking needed to improve the error rate of automatic syntactic annotation tools, which still perform much less reliably than part-of-speech taggers. However, to match the size of a balanced POS-tagged corpus, the use of automatic parsing tools without manual checking is inevitable. Building on previous efforts in treebank annotation, especially the Prague Dependency Treebank (PDT) and the NEGRA/TIGER Corpus (Hajič, 2006; Hajič et al., 1998; Skut et al., 1997, i.a.) we want to make a further step towards a large corpus with a reasonably reliable, automatically assigned syntactic annotation. With this aim in mind, we propose an explicitly defined annotation scheme consisting of a linguistically founded, potentially underspecified morphological and syntactic core, complemented by multiple interaction shells, customizable in shape and detail according to the preferences of humans or computer applications, accessible to lay users and satisfying demands of experts at the same time (§2).

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An annotation scheme for Persian based on Autonomous Phrases Theory and Universal Dependencies

A treebank is a corpus with linguistic annotations above the level of the parts of speech. During the first half of the present decade, three treebanks have been developed for Persian either originally or subsequently based on dependency grammar: Persian Treebank (PerTreeBank), Persian Syntactic Dependency Treebank, and Uppsala Persian Dependency Treebank (UPDT). The syntactic analysis of a sen...

متن کامل

Towards an open-source universal-dependency treebank for Erzya

This article describes the first steps towards a open-source dependency treebank for Erzya based on universal dependency (UD) annotation standards. The treebank contains 610 sentences with 6661 tokens and is based on texts from a range of open-source and public domain original Erzya sources. This ensures its free availability and extensibility. Texts in the treebank are first morphologically an...

متن کامل

Towards a Multi-Representational Treebank

Computational, descriptive, and theoretical linguistics use both phrase (PS) structure and dependency structure (DS) to represent syntax. We believe that the next-generation treebank should be multi-representational, designed for both representations with an automatic conversion. In this paper, we highlight the assumptions made by existing PS-to-DS and DS-to-PS conversion algorithms and show th...

متن کامل

تصحیح خودکار خطا در درخت بانک نحوی با استفاده از یادگیری ماشینی انتقال محور

The Treebank is one of the most useful resources for supervised or semi-supervised learning in many NLP tasks such as speech recognition, spoken language systems, parsing and machine translation. Treebank can be developded in different ways that could be, generally, categorized in manually and statistical approaches. While the resulted Treebank in each of these methods has the annotation error,...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011